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Abstract 


For decades, standardized reading comprehension tests have consisted of a series of passages and 
associated multiple-choice questions. Although widely used in and out of the classroom, there 
continues to be considerable disagreement regarding how or whether such tests have net value in 
the service of advancing educational progress in reading. This chapter begins with a review of 
features that characterize standardized reading assessments. In particular, we discuss how 
assessment designs and analytics reflect a balance of practical and measurement constraints. We 
then discuss how advances in the learning sciences, measurement, and electronic technologies 
have opened up the design space for a new generation of reading assessments. Abstracting from 
this review, we end by presenting some examples of prototype assessments that reflect 


opportunities for enhancing the value and utility of reading assessments in the future. 


If frequency and time spent administering assessments to students were criteria of 
success, then the current era in U.S. schooling could be considered a golden age of testing. For 
example, a recent report from the American Federation of Teachers (Nelson, 2013) provided 
some staggering facts about the volume of testing in two school districts in the U.S. In one, there 
were 34 test administrations and as many as 47 in the other. This translated to anywhere from 
three full school days to nearly two weeks of time dedicated to testing. Test preparation time 
varied from 16 full school days to approximately a month. If this study is even marginally 
representative of schools across the country, there is no shortage of testing in our schools. 

Despite their ubiquity, the abundance and increasing prevalence of assessments in 
schools is not an end that is universally lauded, especially when the stakes are considered high 
(Minarechova, 2012). Even before the era of No Child Left Behind, researchers have argued 
over the amount of high stakes testing and its effect on driving the curriculum (Neill, 1997). 
High stakes tests have been criticized as negatively impacting construct validity, as well as 
increasing corruption, cheating, and affecting how cut score decisions are made (Berliner, 2011; 
Petress, 2006). These effects seem more pronounced in years and grades where high stakes tests 
are administered, as compared to grades and years in which they are not administered (Stecher & 
Barron, 2001), indicating the testing is driving the effects. High stakes testing also affects 
instruction time. After high-stakes testing is introduced, instructional time for the subjects that 
are tested (e.g., in English Language Arts (ELA) or mathematics) increases (Au, 2007). 
However, this comes at a cost to the instructional time devoted to other subjects that are not the 
focus of the high stakes testing, e.g., social studies (McMurrer, 2008). Clearly, high stakes 


testing has an impact on education, the curriculum, and instructional time. 


The negative reaction to high stakes testing is not limited to academics and educators, but 
has spread to the general public as well. For instance, a recent poll of registered voters in New 
York showed that 52% of respondents indicated there is too much testing, while only 12% 
indicated there is not enough (Siena College Research Institute, 2013). This negative view on 
testing has led New York to consider revising its state’s testing policies (Spector, 2013). Ata 
national level, public opinion towards the Common Core State Standards and testing has even 
caused a “Don't Send Your Child To School Day” movement (Owens, 2013). Clearly at the 
public and academic level, there is broad concern about the amount, type, and use of testing in 
schools. 

If assessments are to be useful for improving learning in applied contexts (such as 
improving comprehension in middle and high school students), then the science of assessment 
needs to respond to the critiques with solutions other than simply more types of tests, more 
frequently administered (Gordon Commission, 2013). Opportunely, the convergence of 
educational policy, the use of electronic technologies, empirical and theoretical research on 
comprehension, and advances in measurement theory in the 21“ century provides a unique 
context for revisiting the traditional design and measurement techniques characteristic of literacy 
assessments (Sabatini, Albro, & O’Reilly, 2012 & Sabatini, O'Reilly, & Albro, 2012). 

In this chapter, we review and present ideas regarding the process of assessment 
construction. We discuss theoretical frameworks and principles used to structure assessments and 
guide item development, as well as psychometric models used to estimate scores. We begin with 
a selective review of the some tenets that typify the state-of-the-art of standardized 


comprehension tests, highlighting strengths and weaknesses that create opportunities and 


challenges. We then discuss the future of comprehension assessment and some ideas for 


optimizing their use in enhancing learning and achievement in middle and high school students. 


Modern Standardized Comprehension Testing 


In this section, we describe some foundational concepts underlying canonical, 
standardized reading assessment designs that are in use today. An examination of these concepts 
can help us to understand which design elements serve or satisfy which content, use, or 
measurement purposes or constraints. This section can then serve as a preface for exploring the 


possibilities and consequences of innovating in assessment design and measurement models. 


Assessments Reflect a Balance of Constraints 

The perspective that we take is that both the form and the utility of assessments are a 
function of how well the design addresses and balances the multiple constraints that need to be 
considered in light of the purpose, use, and interests of stakeholders. While the effects of testing 
have been well documented over the past 100 years (Phelps, 2012), modern standardized tests 
represent years of optimizing the trade-offs between various technical and practical constraints 
imposed on design and statistical modeling.' It is beyond the scope of this chapter to address 
every key concept. Instead, we focus on the following design and implementation concepts: a) 
the construct; b) standardization; and c) cost and time efficiency. We then address the following 
psychometrics concepts: a) classical test theory, b) unidimensionality and item independence, c) 
reliability, and d) validity. Below, we introduce each very briefly, then discuss a number of 


constraints that arise from traditional definitions or techniques used to operationalize the 


' For those interested in a more complete and technically sophisticated treatment of measurement concepts, issues of 
ethical design and use, and modern day advances, a library of measurement books are available (e.g., see 
AERA/APA/NCME, 1999; Brennan, 2006; ETS, 2002). 


concepts in testing. In the subsequent section, we will introduce advances that are changing the 


landscape of limits and constraints in designing innovative assessments of comprehension. 


Design and Implementation 

Defining and measuring the construct of reading comprehension. There is no 
universally agreed upon, single theory of comprehension, and therefore, by implication, no 
unified reading comprehension construct definition (Cain & Parrila, 2014; Perfetti & Stafura, 
2014). What is largely agreed upon is that the cognitive knowledge, skills, and dispositions that 
comprise an individuals’ proficiency in comprehension are invisible (unobservable or latent, as 
some measurement specialists prefer to say). We can only infer their presence from evidence 
collected as individuals perform comprehension tasks. A reading assessment is generally a 
collection of tasks (texts plus questions about those texts); the examinee’s responses are the 
evidence. One of the primary challenges in assessment design is in defining the target construct, 
choosing tasks that represent that construct definition, and evaluating the evidence trail those 
tasks produce. 

One aim of a strong assessment design is to measure broadly the target construct. The 
intent of broad construct coverage is to enhance the validity of the inference that an examinee (or 
group in some cases) possesses the knowledge and skills representative of proficiency in the 
target domain. Breadth of coverage would seem to increase the generalizability of the inference 
from observed performances to the construct. One would like to make a claim about an 
individual’s (or groups’) general ability in, for instance, reading comprehension, and not merely 
a claim that on a specific day the individual was able to read specific passages and answer 


specific questions. 


As in other applied statistical sampling situations, the notion is that one defines the scope 
of the construct domain, usually categorized across several dimensions, then samples 
systematically across that domain to obtain a reliable estimate of an individual’s ability. In 
reading, this typically has taken the form of a two dimensional matrix: the first dimension 
consisting of the spectrum of text types an individual might encounter; the second consisting of 
the skills that one is likely to apply when comprehending those texts. Curriculum skill standards 
can be used to describe priorities for instruction and learning within this construct space, thus, 
they often weigh heavily in constructing the matrix of valued knowledge and skills. 

One trade-off that is often required to maximize the breadth of coverage, though, is 
depth, resulting in an assessment (or a curriculum) that is sometimes described as a “mile wide 
and an inch deep” (Schmidt, McKnight & Raizen, 1997). Depth may be interpreted to mean 
reliable estimates of subskills. If test items are widely and unevenly sampled across the domain, 
precise inferences about specific subskills are not possible. Depth can also mean engaging the 
learner in deeper, more complex reading tasks. Deeper tasks often mean permitting the student 
more time with a selected set of texts to reason, reflect, and respond to complex problems. In 
order to ask deep questions, more time may be required to respond to a targeted set of questions; 
at the expense of broader coverage one might get from simpler questions that can be responded 
to quickly. For example, while one of the advantages of performance assessment is an increase 
in depth of skills tested, it is often at the expense of reduced generalizability in comparison to 


more traditional tests (Miller, 2002). 


Standardization. Standardization concerns instantiating a test in a consistent fashion 
for all examinees. The intent of standardization of instructions, administration, and scoring is to 


maximize objectiveness and comparability of scores across a population, which in turn impacts 


test reliability, validity, and fairness. Non-standardized procedures increase the risk that 
different individuals may have unfair advantages or disadvantages, resulting in scores that do not 
reflect their true ability on the targeted construct. Standardization does not prevent bias, but at 
least it systematizes it, making it easier to detect by other means -- e.g., differential item 
functioning (DIF), which is used to detect items that function differently in subgroups of interest 
such as gender or ethnicity (Santelices & Wilson, 2012) -- and it does preclude some kinds of 


overt bias. 


While beneficial, standardization when taken to the extreme may constrain the inferences 
that can be made from test scores. This can occur when key aspects of the target construct are 
not measured, because the effort to standardize the administration and scoring is high (e.g., 
training scorers to objectively score essays). By neglecting to measure parts of the construct, the 


validity of the score as a measure of the construct is threatened. 


Unprincipled standardization may also lead to unintended consequences. For example 
imposing time constraints in a reading comprehension test may shift the construct from 
measuring true reading ability to measuring individual differences in processing speed. 
Conversely, providing unlimited time on a measure designed with a fixed time limit (perhaps 
with the intent of taking into account variation in processing speed) would be similarly 
inappropriate. In any event, standardization involves making a set of choices that maximize the 
consistency of some administration features of the test to ensure the generalizability of the 
assessment. However, issues of construct coverage and standardization are often also balanced 
against more practical constraints, such as cost and efficiency, which are discussed next. 

Cost and efficiency. In balancing assessment design features, a practical constraint is 


often defined by the cost and efficiency of the test (Peng, Li, & Wan, 2012). In practice, this has 


resulted in the robust use of multiple choice items to measure reading ability (Rupp, Fern, & 
Choi, 2006). The multiple-choice (MC) item format has become so widespread in standardized 
testing, perhaps, because of how it simultaneously helps to meet multiple design (and 
measurement) constraints. Often maligned and criticized, the MC format confers multiple 
benefits. MC items can be objectively and automatically scored, addressing the standardization 
constraint. Open-ended or constructed response (CR) items can also be scored objectively, 
however, historically, CR items have been costly to administer (students require more time to 
respond than typical MC items) and costly to score (after factoring in training and calibrating 
reliable scorers). The added time required to complete CRs also impacts on the breadth of 
construct coverage a test can accomplish. 

MC items allow for more items to be administered per unit of time than many other 
alternatives, allowing wider breadth of sampling of the domain per unit time; consequently they 
are time efficient. In addition, until recently, a significant benefit of printed MC format tests was 
their cost effectiveness for large-scale, group testing. More items could be printed per page, and 
with bubble-entry answer sheets, test booklets could be reused, while answers could be scored 
automatically. The advent of computer and web-administration of tests, however, is reducing the 
need for printed tests. Consequently, this benefit is diminishing (though MC items still confer the 
benefit of efficiencies associated with adaptive testing, which will be discussed later in the 
chapter). Finally, sophisticated, yet efficient statistical techniques and theories have been 
aligned with the dichotomous item score (i.e. correct vs. incorrect).” 

While MC items have many benefits, an over reliance on traditional forms of MC may 
have other unintended consequences (see Rupp et al., 2006). For instance, MC items are useful 


? Tt is not that psychometrics cannot handle scores other than dichotomous; however, the complexity increases and 
efficiency in design and analyses typically decrease. 
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for testing recognition processes, but not the recall of information or the ability of the individual 
to generate a response. In general, most applied settings of knowledge and skills do not resemble 
the context of choosing among prepared, alternative responses. Providing incorrect alternatives 
(distracters) in MC format can activate incorrect or irrelevant knowledge. Similarly, poorly 
constructed multiple choice assessments can be problematic because the correct answers can 
often be selected without reading the passages (Katz & Lautenschlager, 2001; Powers &Wilson- 
Leung, 1995). If poorly designed multiple choice questions can be answered without the 
passage, then the validity of the test is severely threatened. 

In sum, while features that are designed to maximize efficiency and reduce costs are 
clearly important, there are trade-offs that can impact the validity of claims about individuals, 
and the utility of test results for different purposes. 

Statistics and psychometrics in testing 

A key feature of the modern standardized test is the technical, statistical machinery of 
psychometrics that has been developed to infer the quality, reliability, and validity of inferences 
from test scores. From its origins in the beginning of the 19" century through today, the 
methodologies associated with test development and analyses have become ever more 
sophisticated, yet precise. In this chapter, we focus on a select set of concepts that we view as 
undergoing a shift from past practice, as innovations in measurement theory are explored and 
implemented in applied contexts. The discussion is mostly non-technical, with the focus on 
explaining concepts versus technical detail. 

Classical test theory. This theoretical approach represents the historical methodology for 
estimating the difficulty and discrimination of test items, as they appear on a specific test form. 


As indicated by the name of the theory, the classical approach is focused on the nature of total 
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test scores, which can be expressed by the relation between an individual’s achieved total score 
(X) at a given administration of the assessment, an unknown true score (7), and an unknown 
error score (E). 

As an illustration, imagine a test consisting of three reading comprehension passages and 
20 questions. If the assessment was administered each week over a period of eight weeks, the 
distribution of scores would demonstrate that at some administrations, an individual’s scores 
might be higher or lower than on other occasions. The best estimate of an individual’s ability 
would not be any of the selected administrations, but rather the average across all the individual 
total scores. Additionally, if the reading comprehension measure was assumed to have no error 
(i.e., EF = 0), then the total score X would be equal to the true score T, and the total test scores for 
the individuals would be considered perfectly reliable. The separation of true versus observed 
score is in recognition of the unobserved or latent nature of constructs. We infer the construct 
based on the observations we make of student behavior and these observations are not without 
error. Understanding, controlling, or minimizing the error is a large part of the technical 
expertise that goes into test design and score modeling. However, as we will see later, deciding 
what is and what is not error is not trivial and may shape the nature of the construct and the 
inferences that can be made from the scores. 

In classical test theory, two features of items are worth noting: item difficulty and level of 
discrimination. Item difficulty refers to the proportion of individuals who correctly respond to an 
item, and ranges from .00 to 1.00 with values closer to one indicating the item is easier. Item 
discrimination characterizes the strength of the relation between item and test performance, and 
in classical test theory is typically evaluated using the point-biserial, item-to-total correlation 


(Nunnally & Bernstein, 1994). Values for this index range from -1.00 to 1.00; negative estimates 
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are not desirable as they indicate that an individual who correctly answers a question is likely to 
have a low total score, and item-to-total correlations from .00 to approximately .20 reflect non- 
existent or weak associations. Taken together, items which are considered to be “good” in 
classical test theory are those that do not demonstrate floor (1.e., < 5% get the item correct) or 
ceiling (i.e., > 95% get the item correct) effects, and where the item-to-test correlations are at 
least .20. 

Classical test theory continues to be a commonly used framework in psychometrics. The 
advantage of classical test theory is that it is relatively simple and it accounts for item difficulty 
and discrimination parameters. However it does not simultaneously account for properties of the 
items and the ability of the test taker into the model; something that item response theory (IRT) 
does take into consideration (Lord, 1980). For instance in classical test theory, measurement 
error is assumed to be the same for all test takers. In reality this is not true, as we discuss later. 

Another set of constraints also arise from the focus of the theory on the test form, rather 
than at an item level. The consequence is often that the assumptions of classical test theory only 
hold when forms are administered intact (i.e., the same items in the same sequence); a challenge 
when developing and validating, for example, multiple, parallel forms and adaptive testing 
programs. As we will discuss, IRT helps address some of these constraints, though others 
persist, and new challenges arise that also must be addressed. 

Test unidimensionality and item independence. Two other historical, psychometric 
assumptions/constraints are unidimensionality and item independence.’ Unidimensionality refers 
to the assumption that all the items on a test measure a single, unitary construct - however that 
construct may be defined. So, if a test is designed to measure the construct of reading, then all 


3 In psychometrics, item independence is introduced as a purely statistical assumption, though it has practical 
implications for task design, as discussed later. 
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the items should measure reading, not math, or science, or geography. Complexities arise as one 
considers whether sampling from different aspects of the construct constitute other independent 
constructs or dimensions. For example, statistics, geometry, and calculus could arguably be 
subdimensions of a unidimensional mathematics construct, or separate, unidimensional 
constructs on their own. Questions often arise concerning what is construct relevant versus 
irrelevant (or error) or pre-requisite skills, as well as whether there are sufficient items to warrant 
detecting psychometrically distinct subdimensions in a test. In general, exploring the 
dimensionality of a test is often a key step in understanding or establishing the validity of 
inferences from scores. Many options now exist for conducting dimensionality analyses, as 
discussed later. 

Item independence concerns the relationship or dependence of getting an item correct 
based on other items in the test. The goal is to be able to treat every item as a random sampling 
from the construct domain. Item dependency typically occurs when an item might provide a key 
piece of information that is necessary to answering a subsequent item, thus, changing the 
probability of the response based on what one knows or learns during the test. In a strict sense, 
item independence is almost always violated when writing multiple questions to a single text 
passage in a reading comprehension test. The individual items may not directly cue each other, 
however, one’s general understanding of the passage may have an influence on the entire set of 
items. Recent innovations surrounding the notion of testlets has started to provide techniques for 
accounting for the variance associated with dependencies among test items (Wainer, Bradlow, & 
Wang, 2007). 

Strict adherence to item independence can result in narrowing the construct. For 


example, research supports the importance of proficiency when reading in multiple text and 
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digital environments, where students are expected to read a set of related sources on a similar 
topic (Britt & Rouet, 2012; Coiro, 2009). In this case, designs for adequately measuring the 
construct might warrant stronger item dependencies than would be deemed as appropriate under 
traditional assumptions. Fortunately, options for exploring item independence and managing 
violations are becoming available. 

In summary, dimensionality and item independence shape how a test is analyzed, 
evaluated, and interpreted. However, without appropriate reliability, a test is typically not 
considered useful for any type of reporting about examinees - an issue addressed in the next 
section. 

Reliability. Test reliability is sometimes represented in journal articles and other 
academic literature as the panacea for ensuring the technical adequacy of a test. Most statistics 
and psychometric textbooks note that test reliability is a necessary, but not sufficient pre- 
requisite to validity. Like validity (discussed next), reliability is a complex technical concept that 
is continually being formulated, contested, re-evaluated, and debated (Haertel, 2006). In 
classical test theory, the staple techniques used to evaluate the reliability of tests have been 
internal consistency (e.g., Cronbach’s alpha), retest reliability, and alternate-form reliability; 
though there has been an increasing amount of criticism of Cronbach’s alpha (Sijtsma, 2009). 
Each technique represents a unique history and perspectives on what aspects of reliability are 
essential, and they are not interchangeable. How reliability is conceptualized varies depending on 
whether the measurement framework is based on classical test theory or IRT (Embretson & 
Reise, 2000; Fan, 1998; Hambleton & Jones, 1993; Lord, 1980; Petscher & Schatschneider, 
2011). Thus, this section focuses on the distinctions between the two theories as they pertain to 


reliability. 
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The basis of the classical test theory definition of reliability is the correlation of a test X 
and its parallel form X'; hence, reliability is often written as rho {X,X'}. A primary assumption 
that follows from this definition of reliability is that the standard error of measurement (i.e., a 
measure of uncertainty in a score; SEM) associated with any person’s total score is constant 
across all individuals.” In practice, achieving this would require strong item to ability matching 
in a test form, so as to ensure that there is no floor or ceiling effects in item responses. However, 
item-ability matching is quite difficult to achieve using classical test theory, because the theory is 
focused on the totality of items (i.e., a total test score). 

IRT models have different assumption about errors" that allow for individuals to vary in 
how precise (or reliable) an individual’s score might be. Precision is derived from what is termed 
information; a special property in item response theory that is calculated from an item’s 
discrimination parameter and the probability of correctly answering an item given a person’s 
ability score. The higher the discrimination parameter and the more closely matched an 
individual’s ability score is to the difficulty of the item, the more information we have about the 
person’s ability, and thus, more precisely their ability is estimated. In the same way that 
reliability in classical test theory is associated with measurement error, so is information in IRT 
associated with a standard error of the estimate (SEE). The advantage of using information in the 
context of IRT is that a more realistic estimate of the reliability of scores for all examinees can 
be achieved. Despite this advantage, the scoring algorithms used to obtain the ability scores and 
SEE are mathematically complex and require complex algorithms for deriving the scores. Thus, 
the lack of transparency in estimation may produce difficulty in explaining the results and how 


they were obtained to school and state officials. 
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Validity. While reliability is a key feature of any test, validity is paramount. The issue of 
validity has been treated extensively by others (e.g., American Educational Research Association 
et al., 1999; Baker, 2013; Kane, 1992, 2006; Messick, 1989; Mislevy, 2007, 2009). While it is 
beyond the scope of this chapter to provide a detailed explication, a few highlights are warranted. 
Prior evaluations of validity, in practice, were traditionally addressed primarily after a test was 
constructed. Test items and forms were created from blueprints, such as the matrix of 
dimensions described above, most often without any explicit cognitive theory or framework in 
mind (Mislevy, 2006, 2008). That the blueprint was considered by experts as descriptive of the 
domain, and that the items aligned with the blueprint, constituted an evaluation of content 
validity. Once the forms were assembled and piloted in a field test, various aspects of validity 
could be investigated statistically such as concurrent and predictive validity, dimensionality 
analyses, and in rare cases, consequential validity. 

Criterion-related and predictive strength remain a high priority in establishing valid 
inferences from test scores, especially for tests used in large-scale, high stakes settings. 
However, in this traditional approach, less attention was often paid to the theoretical and 
empirical evidence for the construct (Baker, 2013; Messick, 1989). To the extent that theory 
influenced item and test design, that theory was often in the test developer’s head, not in a more 
explicit set of claims set out in a predefined framework to be evaluated empirically. Using 
principled item and assessment development methods help fill the void of strictly empirically- 
driven test construction. 

Conceptions of validity now emphasize the importance of constructing assessment 
arguments consisting of claims, and evidence in support of those claims, which may be evaluated 


using measurement techniques (Baker, 2013; Kane, 1992, 2006; Messick, 1989; Mislevy, 2006, 
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2008; Mislevy & Sabatini, 2012; Shephard, 2013). Mislevy & Sabatini (2012) note that the 
argument framework for assessment provides tools that go beyond traditional measurement 
approaches to validity, stating: 
“The key is that the roles of psychological perspectives, evaluation procedures, 
and task features—all absent from the measurement framework—are now explicit 
in assessment argument structures, to be articulated with measurement 
machinery” (p. 121). 
The goal is to validate the inferences made from test scores for specific purposes, uses, and target 
populations. This contrasts with the older practice of thinking of the validity of the test itself, 
independent of the scores, uses, or inferences drawn. This evidence trail may include the results 
of analyses typically done after the construction of a test, but more often begins much earlier 
during the design process. Evidence-centered design is a process developed to build assessments 
on cognitive and empirical evidence that enhances the claims of a validity argument as a 
consequence of a systematically conducted design process, as well as empirical field test data 
and analyses (Mislevy & Haertel, 2006; Mislevy, Steinberg, & Almond, 2003). 

In summary, validity in not a property of the test itself; nor is it something that should be 
investigated only after a test has been built, but rather should be infused in all phases of 
assessment development. Even after a test has been built and has been shown to have adequate 
psychometric properties, evidence should be collected and accumulated over time to support 
specific claims about test score use. In the remainder of the chapter, we describe innovations in 
assessment design and in psychometric analysis and modeling that are opening up new types and 


applications of reading assessments. 
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Opportunities and Challenges in Enhancing Comprehension Assessments 

The purpose or use of assessment results drives the interpretation of scores and should 
drive the construction of the assessment instrument itself (Mislevy, 2006, 2009; Mislevy & 
Haertel, 2006). Table 1 provides a typology of typical purposes or uses of assessment 
information in schools as associated with comprehension (though many of these types certainly 
also apply to other subject areas such as math and science). The table is roughly ordered from 
top to bottom with respect to when in the instructional program the assessment would 
characteristically and logically be administered, as well as the typical level of inference for the 
scores. For example, one would expect to screen students for pre-existing barriers to learning or 
place them into a level in an instructional program before starting the program; while one would 
administer outcome testing after students have completed a program. Formative and monitoring 
assessments logically occur during the learning program. We excluded from the table some 
special case assessment purposes including selection; certification (typically used with 
professionals such as teaching certifications); referrals (such as evidence used to refer an 
individual for special education services). We note that requiring students to pass high school 


graduation tests is also a special case of outcome assessment, with higher stakes. 


Applied comprehension assessment in middle and high school contexts 

Although outcome assessments and other high stakes tests are abundant in middle and 
high schools, use of assessment before and during instruction in these settings is limited, 
although some instrument options, with demonstrated reliability and validity, currently exist for 
addressing screening, progress monitoring, and other formative assessment purposes. The Center 


on Response to Intervention website (http://www.rti4success.org/) is a good resource to find 
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instruments that have been reviewed by a Technical Review Committee of experts for technical 
rigor and use. Most reviewed assessments by the Center utilize curriculum-based measurement 
(CBM) with demonstration of use only up to grade 6 or 8, although a few computer-adaptive 
(i.e., IRT-based) assessments of reading comprehension up to grade 10 or grade 12 are available 
(e.g., Renaissance Learning’s STAR and NWEA’s Measures of Academic Progress). The 
measurement strengths and weaknesses of CBMs are described elsewhere (see Christ & Hintze, 
2007) and further advancement of computer adaptive testing is discussed later in this chapter. 

A majority of the research literature exploring assessment before and during instruction 
lies in the response to intervention (RtI) literature (e.g., Christo, 2005; Compton, Fuchs, Fuchs, 
& Bryant, 2006; Fuchs, Compton, Fuchs, Bryant, & Davis, 2008; Klingner & Edwards, 2006; 
O’Reilly, Sabatini, Bruce, Pillarisetti, & McCormick, 2012). Although there is some support for 
RtI assessment practices in middle and high schools, their use in elementary schools has 
undergone more rigorous evaluation (Jimerson, Burns, & VanDerHeyden, 2007). Barriers 
inherent to secondary settings tend to limit rigorous study with this population (Fuchs, Fuchs, & 
Compton, 2010). 

Fuchs et al. (2010) point out three considerations unique to secondary settings that have 
implications for the uses of RtI-style assessments. First, screening assessments may be less 
critical, as students in need of intervention have mostly been previously identified. Secondly, 
since the gap in achievement may be very large, outcome assessments need a sufficient floor. 
One broad example of problems secondary schools face with inferences from data is highlighted 
by Fuchs et al.’s (2010) third consideration. Elementary schools use screening, diagnostic, and/or 
curriculum-embedded measures to match students to effective interventions. The increasingly 


broader range of skills involved in reading comprehension in struggling middle and high school 
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students and dilution of responsibility for teaching certain skills in secondary settings, make it 
more challenging to match students to instruction and intervention appropriately. Without 
additional diagnostic assessment, effects from matched instruction may be limited. 

In addition, the systematic review of data in secondary settings is impeded by a relative 
lack of “structured occasions to turn assessment information into actionable knowledge” 
(Halverson, 2010, p. 133). Regularly scheduled team meetings where educators discuss 
instructional decisions based on data is one way to systematically ensure that assessment data is 
used appropriately for its designed purpose. Clarity in the intended claims, inferences, purposes 
and uses that assessment scores are intended to serve as the first step in addressing the multiple 


constraints that any applied assessment situation may entail. 


Psychometric Advances 

IRT & MIRT. Earlier in the chapter, we noted two specific utilities of IRT relative to 
classical test theory. First, IRT places items and individuals on the same metric, such that the 
likelihood of correctly answering an item can be related to varying levels of ability scores. 
Second, it relaxes classical test theory constraints on equal measurement error to allow for 
individual precision estimates of ability scores. In addition, there are multiple virtues of IRT, 
which help to address other complex measurement issues including invariance (Embretson & 
Reise, 2000; Messick, 1983), equating, and resolving multidimensional constructs. 

Invariance in classical test theory depends on two assumptions: item parameters are 
statistically equivalent across different groups of individuals and the ability of the individuals is 
statistically equivalent across a set of items. Despite the importance of these assumptions to 


classical test theory, they are easily and frequently violated. A lack of item invariance across 
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different groups of individuals precludes meaningful comparisons in total test scores.* Suppose 
that two classrooms’ vocabulary ability is being measured, and a list of twenty words is 
developed to split across the two classrooms. The equality of students’ scores is dependent on the 
equality, or invariance, of the item difficulty. Conversely, suppose that the same list of twenty 
words is given to two separate classrooms, one which has a high incidence of students eligible 
for free/reduced priced lunch, and another which has low incidence of free/reduced priced lunch. 
It is likely that the difficulties of the items will vary between classrooms. In both instances it is 
difficult to make meaningful interpretations of the resulting scores because they are confounded 
by item difficulty differences in the first example, and student ability differences in the second 
example. IRT overcomes such limitations because its theory rests on the idea that item 
parameters are not dependent on the sample, they are a property of the item. Thus, while an item 
with an IRT difficulty of 0 (i.e., average difficulty) will potentially be harder for the classroom 
with a high incidence of students eligible for free/reduced priced lunch compared to low 
incidence classrooms, the difficulty of the item remains approximately the same between the 
classrooms. 

A related concern is equating. Because the assumption of item invariance is often 
violated, it is necessary to adjust scores such that a total test score based on a set of items means 
the same thing as another set of items from a parallel form. Several methodological designs are 
available in classical test theory (e.g., single group, common item nonequivalent group, and 
random group) as are multiple statistical procedures for converting scores (e.g., mean equating, 
linear equating, equipercentile equating; Kolen & Brennan, 2004). A limitation of equating 
methods is that it is useful for adjusting scores for a group of examinees, but not each individual 
(Livingston, 2004). IRT overcomes such limitations by using multiple-group item characteristic 


4 Tn classical test theory, methods of equating test forms are used to address these kinds of problem. 
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curve and test characteristic curve (Stocking & Lord, 1983) methodologies. These analyses are 
also used in the previously mentioned methodological designs for equating, but are especially 
useful from a theoretical perspective, when tests vary in the difficulty of items or the groups vary 
in ability. Further, IRT equating does not require extreme scores at the tails of the distribution in 
order to provide a meaningful translation of scores, and it requires fewer steps in execution when 
the items are on the same scale. 

Dimensionality with complex structure. Notwithstanding the numerous benefits IRT 
maintains over classical test theory, a particular challenge surrounds assessing and addressing the 
assumption of unidimensionality of item responses. While measuring a singular construct is 
desirable, there are many instances which may preclude a unidimensional construct from 
emerging. The breadth of the construct being measured, the nature of item stimuli, the number of 
items written to reflect each dimension, and the knowledge required to complete the task each 
have bearing on the extent to which a test of unidimensionality yields a best fitting model for a 
single construct. Several statistical methods exist by which dimensionality can be evaluated. 
There are exploratory and confirmatory factor analyses which may be estimated using parametric 
and non-parametric estimations (Kim, Zhang, & Stout, 1995; Stout, Douglas, Junker & Roussos, 
1993; Tate, 2002), yet even with these options; a key question is how to resolve complex 
dimensionality issues. To guide the remainder of this discussion on IRT, we put forth the 
following scenario and discuss three possible solutions. 

Suppose that a researcher has developed a new assessment of reading comprehension, 
which is comprised of two different reading comprehension passages, one of which is an 
informational passage and the other is narrative. Each passage has ten questions which require 


the reader to identify the main idea of the passage, draw an inference from the text, distinguish 
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between fact and fiction in the passage, evaluate textual evidence to support conclusions, and 
demonstrate definitional knowledge of textual vocabulary. 

The most common method for evaluating student ability on this type of assessment is to 
simply sum the scores of the twenty items as a representation of reading comprehension ability. 
Figure la represents this process, which assumes that the scores are indeed unidimensional. 
While convenient, it is possible that several other models may provide better fit to the data. 
Because each passage has ten items, it is plausible that the variances are best captured by two 
related factors; one for the informational passage items, and one for the narrative passages items. 
Shown in Figure 1b, this perspective would fit in the framework of a multidimensional factor 
analysis, where two latent factors, one for each passage, with factors that are correlated. More 
specifically, we refer to the model in Figure 1b as a multidimensional item response (MIRT) 
model when the items are categorical; or modeled with a non-linear, multidimensional, 
confirmatory factor analysis. 

MIRT models have gained popularity in recent years (Reckase, 1997), as they are able to 
capture distinct, yet related processes which influence item responses. Under circumstances 
where a correlated factor model yields the most appropriate fit to the data, it suggests that the 
processes used to answer questions for one construct, such as the informational passage, may 
also underlay or contribute to performance on the other construct (i.e., the narrative passage). In 
MIRT terminology, this is known as a compensatory item response model, because high ability 
in one domain provides useful information in understanding the performance on the second, 
correlated construct. At a broad, theoretical level, a compensatory MIRT model is no different 
from a logistic regression with multiple predictors. For any given value of one independent 


variable, the probability of Y=1 will vary given a value on a second independent variable. It is 
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possible that a low value on variable | and high value on variable 2 yields the same probability 
of Y=1 as a high value on variable 1 and low value on variable 2. Thus, the MIRT model 
leverages one’s higher ability on one construct for lower ability on another construct. A primary 
question when fitting a MIRT model is the extent to which a correlated construct model, while 
fitting better than a unidimensional model, provides information on construct relevant skills. If 
not, then perhaps revisiting the assessment design might be most appropriate. 

An alternative multidimensional specification for the data in this illustration is a bi-factor 
model (Figure Ic). Bi-factor models seek to explain item correlations with a general factor of 
what is believed to be measured by the item responses, along with two or more specific factors 
which model the residual item variance not captured by the general factor. In the present 
example, a general factor of reading comprehension would best represented item variation across 
all twenty items, while two specific factors would represent the residual variance which could be 
differentially attributed to features of the narrative and informational passages. 

In summary, there are a wide range of techniques available for modeling dimensionality 
of assessments, thus, relaxing some of the constraints that the assumptions of unidimensionality 
may have imposed on the design. These techniques help in designing assessments that are 
theoretically sound and more useful in applied settings. 

Local item independence. Just as bi-factor models are useful in resolving dimensionality 
issues, they also have applicability to modeling violations of local item independence. The 
concept that the likelihood of an item response is independent of responses to other items has 
been closely linked to the assumption of unidimensionality (Stout, 1990), yet our presentation 
here is concerned with how to manage such violations. As we noted earlier, local item 


dependency (LID) often occurs in traditional tests of reading comprehension. One of the most 
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frequently used methods to identify LID is via Yen’s Q3 statistic (Yen, 1984), which is the 
correlation between two items after accounting for overall test performance; the larger the 
correlation, the greater the presence of LID. While this procedure is useful in identifying where 
LID may exist for an assessment, it does not explain why it might have occurred. 

LID tends to occur when items are grouped under a shared stimulus, such as a reading 
comprehension passage, or a word problem in math, and we can term such groupings, or bundles 
of items, testlets (Wainer, Bradlow, & Wang, 2007). The presence of LID would be expected to 
be higher within each testlet (e.g., the narrative or informational passage), than across testlets; 
thus, we could model the impact of LID via a bi-factor model. In this case, the bi-factor model is 
used to estimate the difficulty and discrimination of the items. Specific factors are identical to 
that in the dimensionality example where one factor is modeled for each passage, but the 
evaluation of the model is focused on how well the items are estimated on the general factor of 
reading comprehension. By using the bi-factor model for LID, an individual can simultaneously 
evaluate the presence of LID via the specific factor variances, as well as obtain item parameter 
and examinee ability scores which are adjusted for testlet effects. 

While bi-factor models are emerging more as a method for handling dimensionality and 
local item dependency in reading data (Kieffer & Petscher, 2013; Petscher, 2011; Rijmen, 2011; 
Yovanoff & Tindal, 2007), there are several limitations worth noting. Compared to the correlated 
factor multidimensional model shown in Figure 1b, the bi-factor model in Figure Ic represents a 
complex structure, whereby each item describes more than one factor, compared to the simple 
structure (i.e., each item describes one factor) in Figure 1b. The bi-factor model estimates more 


parameters; thus, more examinees are required to ensure that items parameters are free from bias. 
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Relatedly, the complexity of the model is such that it often takes longer to converge and may 
need more appropriate starting values compared to other model specifications. 

Scaling and estimation. A natural query which may emerge after having read through 
the prior sections might be, “Is there a tangible benefit to implementing such complex models?” 
After all, testing the models described here are helpful for methodologists and statisticians, but to 
what extent do such models assist in understanding student performance on the assessment? The 
answer is - there is a benefit. Selecting the appropriate factor model (i.e., unidimensional, simple 
multidimensional, or complex multidimensional), estimation model for item parameters (e.g., 
Rasch model or 2-parameter logistic model), or estimator (e.g., maximum likelihood or weighted 
least squares) are necessary processes to placing scores on a common scale (Gorin & Mislevy, 
2013; Tong & Kolen, 2010). A common scale is critical so that scores can be used to track 
growth within and across academic years for individual students, and is important for ensuring 
that normative scores reflect accurate population achievement. Moreover, common scales are 
critical for selecting cut scores in standard setting such as the Bookmarking (Lewis, Green, 
Mitzel, Baum, & Patz, 1998) or Modified Angoff procedures. Further, when scores are 
empirically used to set benchmarks for interim assessments to make screening decisions, a 
common scale is critical to the process of ensuring that identification procedures are well 
validated. In sum, complex modeling creates scales and score estimates that align to specific 


purposes or uses, thus, enhance the validity of the inferences made from those scores. 


Envisioning the future of reading assessment 
Traditional tests have been widely criticized for failing to incorporate the cognitive and 


learning science literature in designs (Mislevy, 2006, 2008; Pellegrino, Chudowsky, & Glaser, 
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2001; Snow & Lohman, 1989). Early attempts at opening up the design space, such as the 
performance assessments of the 1990s, met with significant challenges concerning construct 
coverage, objectivity, and consistency of scoring, cost-effectiveness, and time-efficiency 
(Gearhart & Herman, 1998; Kafer, 2002; Koretz, Stecher, Klein, & McCaffrey, 1994a, 1994b; 
Koretz, Stecher, Klein, McCaffrey, & Deibert, 1993). Thus, their feasibility and utility was 
rightfully questioned. 

However, several concurrent forces are changing the equation concerning what is feasible 
and useful. Specifically, the migration of so much of the educational (and reading literacy) 
construct domain to digital forms; the availability and sophistication of technology-based 
delivery and scoring platforms; and advances in measurement techniques are ushering in a new 
world of possibilities for assessment of any kind and especially for reading literacy (See O’Reilly 
& Sabatini, 2013; Sabatini, Albro, & O’Reilly, 2012; Sabatini, O’Reilly, & Albro, 2012; Sabatini 
O’Reilly & Deane, 2013; Sabatini & O’Reilly, 2013). Although the constraints described above 
still operate, there are new solutions for addressing and optimizing assessment designs to meet 


the constraints. 


The Call for a New Generation of Reading Assessments 

Previously, we discussed the foundational concepts that led to the development of 
traditional assessments. We framed that discussion in terms of the balancing act between the 
definition of the construct, the purpose of the assessment, the particular needs of the end users, 
and the constraints imposed by logistical, psychometric, economic, and practical issues. Despite 
these challenges, however, advances in technology and in particular, changes in theoretical, 


political, and social attitudes have begun to reshape how we think about assessment. 
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In recent years, a number of scholarly reforms have been proposed to argue for a new 
kind of assessment. Most notably, these include the Common Core State Standards (National 
Governors Association Center for Best Practices, & Council of Chief State School Officers, 
2010), the associated Race to the Top Funding (U.S. Department of Education, 2009), and the 
major consortia, the Smarter Balanced Assessment Consortium, and the Partnership for 
Assessment of Readiness of College and Careers. The movement also includes other progressive 
frameworks and standards such as the Partnership for 21“ century skills (2004, 2008); panels and 
commissions on assessment reform (Gordon Commission, 2013); assessment reform initiatives 
at major testing companies (Bennett, 2011b; Bennett & Gitomer, 2009); framework innovations 
in international assessments of reading such as PISA (Organisation for Economic Co-operation 
and Development (OECD, 2009a), PIAAC (OECD, 2009b), PIRLS (Mullis, Martin, Kennedy, 
Trong, & Sainsbury, 2009), and ePIRLS (International Association for the Evaluation of 
Educational Achievement, 2013a,b); and various publications on assessment reform (e.g., 
Pellegrino et al., 2001). 

Collectively, these efforts call for a new generation of reading literacy assessments that 
reflect a broader conceptualization of the construct that goes beyond what traditional assessments 
have been designed to measure. In particular, these construct features include, but are not 
limited to: purpose-driven or goal-directed comprehension (McCrudden, & Schraw, 2007; van 
den Broek, Lorch, Linderholm, & Gustafson, 2001), multiple text comprehension (Britt & Rouet, 
2012; Gil, Braten, Vidal-Abarca, & Stroms¢, 2010; Goldman, 2004), disciplinary and content 
area reading (Goldman, 2012; Lee & Spratley, 2010; Shanahan & Shanahan, 2008; Shanahan, 
Shanahan, & Misischia, 2011), digital literacy, online reading or reading in technological 


environments (Coiro, 2009; 2011; Leu, Kinzer, Coiro, Castek, & Henry, 2013) and social 
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interaction including collaboration and communication (NGACBP & CCSSO, 2010; Partnership 


for 21‘ Century Skills, 2004, 2008). 


What Might These New Assessments Look Like? 

Although there is great enthusiasm for progressive assessment reform, instantiating these 
ideas in a feasible, practical, and sound manner are not without challenges. For instance, while 
there is a growing research base in many of the areas described above, the cognitive and learning 
science literatures are new and many of these efforts have not been investigated when the 
primary purpose is the design of valid and reliable assessments -- most extant research is focused 
on either basic research or the design of learning and instruction. In order for the pieces to fit 
together, a coherent synthesis of the literature needs to be constructed with assessment 
considerations and constraints in mind. That is, fragmented and separate literatures need to be 
integrated into coherent assessment frameworks. The frameworks, in turn, would be used to 
design items, tasks, and test forms. Then, associated claims can be formulated during the design 
process, and evaluated during and after test construction on the basis of cumulative evidence. 

At the international level, several innovative reading frameworks have been developed 
including the aforementioned PISA (OECD, 2009a), PIRLS (Mullis, et al., 2009), ePIRLS 
(IAEE, 2013a, b), and PIAAC (OECD 2009b). Collectively, these large-scale frameworks have 
been modernized to reflect issues such as multiple text understanding, digital and online reading, 
and even collaborative problem solving (OECD, 2013). Interested readers are encouraged to 
consult the reading frameworks of the national and international reading assessments. 

Although the international assessments described above are innovative, they still have to 


work under a host of practical and operational constraints. As such, many “riskier” design 


30 


features may have to wait for future administrations. So what will the future of reading 
assessment look like in 5-10 years? Predicting the future is always difficult, but it might be 
useful to look at some examples of large scale research projects that are currently underway. 

The first is an ongoing research project that began in 2007 called Cognitively Based 
Assessment of, for, and as Learning or CBAL for short (Bennett, 201 1a, b; Bennett & Gitomer, 
2009)°. CBAL is an innovative approach to assessment in k-12 settings and has been developing 
assessments in the English Language Arts (ELA), mathematics, and science. The CBAL ELA 
competency model, akin to an assessment framework (Deane, Sabatini, & O'Reilly, 2012) is 
based on a synthesis of the literature of reading, writing, thinking, and their connections. 
Multiple prototype ELA summative and formative assessments have been developed and 
evaluated (Bennett, 2011b). A key goal of CBAL is to integrate the research in the learning 
sciences to improve construct coverage and make the assessments meaningful for instruction.° 

A similar research project, called Reading for Understanding (RfU) initiative was funded 
by the Institute of Education Sciences (Institute of Education Sciences, 2010). The purpose of 
this large-scale initiative is to improve reading outcomes though both intervention and 
assessment. Relevant to the current chapter is the work of the assessment team (see ETS, 2013) 
which includes research partners at multiple universities including Florida State University, 
Northern Illinois University, and the University of Arizona. The assessment team is charged 
with developing innovative assessments of reading comprehension and component skills for 
students in prek-12 settings. Key to this effort was the integration of the theoretical and 


empirical literature in the learning sciences including the areas of reading comprehension, 


5 Interested readers should visit the CBAL website at : http://www.ets.org/research/topics/cbal/initiative 


® Due to space limitations, we only elaborate on the RfU assessment project in the paper. Both CBAL and RfU share 
many of the same underlying principles and both incorporate innovative design techniques including scenario-based 
tasks and assessments. 
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reading components, reading strategies, measurement, metacognition and self-regulation, 
motivation, and the general cognitive science literature (O’Reilly & Sabatini, 2013; Sabatini & 
O’Reilly, 2013, Sabatini et al., 2013). 

The confluence of findings from this body of work has informed the development of a 
reading framework that guides the design of items, tasks, and forms for multiple assessments 
developed under the RfU initiative, most notably, an assessment called the Global, Integrated 
Scenario-based Assessment (GISA). Moreover, specific findings from the National Reading 
Panel (National Institute of Child Health and Human Development, 2000) and the National Early 
Literacy Panel (Eunice Kennedy Shriver National Institute of Child Health and Human 
Development, NIH, DHHS, 2010), as well as the reading framework developed by Sabatini & 
O’Reilly (2013) guided the development of a component skills assessment called the FCRR 
Reading Assessment (FRA; Foorman, Petscher, & Schatschneider, 2013) and SARA (Sabatini, 
Bruce, & Steinberg, 2013). For the goals of this chapter, we present a broad discussion of the 
purposes of each type of assessment. The GISA has been developed, in part, from the stand-point 
of construct coverage and supporting learning, while a goal of the FRA, a computer adaptive test 
(CAT), is focused on time efficiency. The proceeding sections on the two assessments 
underscore the point that the different designs represent different ways of balancing purposes and 
constraints. In the cases below, the different assessments can be used to serve complementary 
goals (for empirical studies, see O’Reilly et al, 2012; Mislevy & Sabatini, 2012; Sabatini, 
O’Reilly, Halderman, & Bruce, 2014). 

GISA. GISA designs are guided by a three part framework. The first part of the 
framework outlines six principles for assessment design that were derived from the literature 


(Sabatini & O’Reilly, 2013). While some of the principles discuss empirical and theoretical 
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issues, such as vocabulary, that are already covered on many existing reading tests, other 
principles cover issues that are not routinely addressed, such as goal-directed reading (or task- 
oriented reading), multiple source integration, and digital literacy. The second part of the 
framework provides a definition of reading, a position on development, the constructs to be 
assessed, and the two assessments designed to measure reading comprehension (Sabatini, 
O’Reilly & Deane, 2013). In brief, reading comprehension is described as the set of knowledge, 
skills, and dispositions that enable readers to construct meaning from text. In particular, five 
dimensions of reading literacy are described: the writing (or print) system, language (or verbal) 
system, text and discourse, conceptual modeling/reasoning, and social modeling/reasoning. 
These dimensions serve as analytic categories for decomposing literacy tasks, such that one can 
describe or evaluate the relative contribution of skills necessary to perform the task successfully. 

GISA utilizes several features that are not routinely found in existing off-the-shelf 
reading assessments (O’Reilly & Sabatini 2013). These features include: the use of scenario- 
based assessment; task designs that model and support evidence-based instructional practice; the 
use of simulated peers; and the inclusion of performance moderators in the design. These ideas a 
briefly summarized below. 

In many traditional reading assessments, test takers are presented with a collection of 
unrelated passages on a range of general topics. Students answer a set of discrete items on each 
passage and then move on to an unrelated passage. In this traditional design, students are 
effectively expected to “forget” what they read previously when answering questions on later 
passages. In other words, there is no overarching purpose for reading other than to answer 


discrete multiple choice questions (Rupp et al., 2006). In contrast to this approach, the GISA 
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uses a scenario-based assessment approach to shape the way passages, tasks, and items are 
processed. 

In a scenario-based assessment, students are given an overarching purpose for reading a 
collection of thematically related sources for the purposes of solving problems, making 
decisions, or completing a higher level task (e.g., make a presentation; edit a wiki). The reading 
purpose sets up a collection of goals, learning aims, or criteria that students use to evaluate 
sources, or decide what information is relevant. The collection of sources is often diverse and 
may include a selection from a textbook, e-mails, blogs, websites, policy documents, primary 
historical documents, and so forth. Students are asked a series of questions about the sources 
ranging from traditional comprehension items (locate information, vocabulary, basic inference) 
to more complex tasks such as the synthesis and integration of multiple texts, perspective taking, 
evaluating web search results, completing graphic organizers, using a rubric to score given 
responses, or applying what they read to a new situation or context. 

Tasks and activities in a scenario are sequenced to reveal what parts of a more complex 
task students can or cannot do. For instance, if a student has trouble writing a summary, thus 
limiting the evidence of their skills, other tasks are provided to determine whether the student 
can recognize a good summary, evaluate a given summary, complete a graphic organizer, or 
identify key ideas. Such a collection of graded tasks helps provide an evidence trail that can be 
used to infer the complexity of tasks a particular student can handle. In this way, complex tasks 
are not viewed as an “all or none activity”, but rather as a way to help triangulate partial student 
knowledge in the larger context of development. Simulated “peer” students are also included 
into the assessment design to provide guidance, hints, and to serve as a way to identify student 


misconceptions or errors in understanding. For instance, a simulated peer may provide an 
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incorrect explanation of a process described in a text and the test takers task is to identify and 
correct the error. 

Other techniques are often incorporated in the test design to provide more information 
about test takers, including their level of background knowledge on the topic of assessment, or 
their level of engagement and motivation. In tandem, these “performance moderators” can be 
used to help interpret test scores. For instance, if a measure of background knowledge indicates 
that the student knew a lot about the topic, then the score could be qualified as possibly reflecting 
more about the student’s knowledge level than their reading ability per se. In a similar vein, if 
measures of engagement indicate that the student was not putting their best effort forward, then 
the score might be qualified as not reflecting the student’s true reading ability. Other 
performance moderators are included in the test design such as metacognition and self 
regulation, as well as reading strategies, to model and encourage good practice. 

To illustrate these ideas, imagine a scenario in which students are asked whether hybrid 
cars are environmentally friendly. Before they read any texts, they are given a background 
knowledge test on related topics such as gasoline automobiles, hybrid cars, electricity, batteries, 
and so forth. Students are then given a preliminary set of passages that help build up their 
general understanding of what a hybrid car is and how it works. Successive sources outline the 
potential benefits (e.g., less fuel consumption, fewer emissions and pollutants released in the 
atmosphere) of hybrid cars, while other texts discuss potential problems (e.g., higher cost of the 
vehicles, environmental impact discarding the batteries). Students are asked to evaluate the 
creditability of the sources (Do the sources have a monetary stake accompanying their position?), 
as well as the reasoning and soundness of the arguments (Do the arguments go off on a tangent? 


Are source authors trying to convince by emotional appeal rather than a logical argument with 
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supporting evidence?). Simulated peers might incorrectly summarize the texts or draw 
inappropriate inferences, and the test taker is asked to correct the summary or inferences, as 
supported by text evidence. Tests takers might then be asked to make a brochure outlining the 
key issues on both sides of the argument and draw conclusions based on the available evidence. 
The scenario-based assessment described above is designed to reflect the way an 
individual might interact and use literacy source material better than is reflected in traditional, 
decontextualized assessments. It presents real problems and issues for students to solve and it 
involves the use of higher level reading and reasoning skills that are demanded by many current 
initiatives. Despite these more demanding goals, the assessment also presents students an 
opportunity to develop their skills, as complex tasks are broken down into more manageable 
subtasks, while empirically supported practices, such as metacognition and reading strategies, are 
incorporated into the design. In this way, the assessment represents an opportunity to support 
learning, in addition to more traditional uses of measuring what is previously learned (in terms 
of content assessment) or understood during the assessment (reading assessment). Although the 
innovations described above are still in their infancy, preliminary data indicate they are feasible 
and worth considering, as new technology and data emerge. Although any and every assessment 
must work with a set of constraints such as those described earlier in the paper, evolution in 
design and in technology can often be integrated into a manageable, but innovative design space. 
FRA — A Computer Adaptive Testing. Time can often be a limiting factor, as many 
assessments use a Static form with a fixed set of items in predetermined order. The item pool 
often consists of items which have a difficulty range, yet most items in a static assessment tend 
to be of a moderate difficulty, with relatively few easy or hard items included. This means that 


for a given group of individuals, low ability students will confront moderate or hard items that 
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are too difficult relative to their ability (hence, yield little information), and high ability students 
will spend less time confronting items that are at their challenge level (hence, yielding less 
information than of their proficiency). A result of this assessment structure is that high 
performing and low performing students have less reliable scores, as well as inefficient tests of 
their abilities. 

Recent innovations in psychometric and technological research, known as computer 
adaptive testing (CAT), allow for assessments to be more dynamic than many traditional forms 
that use a fixed set of items in a predetermined order. The intricacies of a CAT have been 
discussed at length in various sources (e.g., Thompson & Weiss, 2011; van der Linden & Glas, 
2010; Wainer, Dorans, Flaugher, Green, & Mislevy, 2000; Wise & Kingsbury, 2000), but the 
essential operations occurs in the following four step process: 1) the examinee is administered an 
item where the difficulty is optimally matched to their ability; 2) the examinee responds to the 
item; 3) the ability score is estimated; and 4) steps 1-3 continue until the examinee meets one of 
several possible termination criteria established by the test developer (e.g., has an ability score 
with a standard error less than some value, or has taken a maximal allowed number of items). 
CATs could reduce testing time, with some estimates as high as 50% (Weiss, 1982; Weiss & 
Kingsbury, 1984), while maintaining strong reliability for most participants. Three particular 
benefits of CAT hold great promise for the next generation of assessments, and are emerging as 
important applications in education: 1) accounting for item dependency, 2) accounting for item 
response lag, and 3) empirical classification of students via item performance. 

CAT can help improve the reliability of scores for all participants by taking into 
consideration the ability estimate of the student. The underlying concept of a CAT is that 


students should be optimally matched to items, rather than forced to take items which are too 
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difficult or too easy relative to their ability level. Because CAT is rooted in IRT, computer 
algorithms are able to search an item pool and continually locate items which are closely 
matched to a person’s ability. Recall that a hallmark of IRT is that the difficulty of the item and 
the ability of the person are both estimated and are on the same metric. In this way, CAT creates 
individual tests customized to the ability of the individual; low ability examinees will tend to 
receive easier items and high ability students will receive more difficult items. 

While CAT has several advantages over static assessments, there are some potential 
drawbacks. One potential concern is construct coverage. If items are optimized to the ability 
level of the student, a particular test taker may not receive items that cover key aspects of the 
construct. This may be acceptable under the assumption of unidimensionality of the construct, in 
that any item might be considered indicative of overall ability. However, this assumption may 
be limiting if one wants to be assured that a variety of tasks representing a complex construct are 
attempted by the examinee. Furthermore, in some states, legislative measures require that all test 
takers take the same assessment. In a literal sense, CAT produces a different test for different 
groups of students. In any event, CAT continues to be an innovative way to help maintain 
reliability in light of time pressure and efficiency concerns, as illustrated in the following 
description of the FRA. 

FRA. The development process of the computer adaptive FCRR Reading Assessment 
(FRA) carefully balances recent understanding of the critical constructs of reading development 
across the school years, multiple approaches to improving the efficiency of test items and 
calculation of scores, and translation of those scores to teachable skills in the classroom from 
pre-k to grade 12. Similar to the GISA, the FRA views reading comprehension as a complex, 


multidimensional construct. The student interface with FRA is such that they may be assessed on 
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a variety of reading component skills relative to their development including: alphabet 
knowledge, phonological awareness, word reading, vocabulary, listening comprehension, 
spelling, syntax, and reading comprehension. FRA has overlap with many off-the-shelf measures 
in reading, but it differs in that it is delivered in a computer adaptive environment. This allows 
students to receive fewer items in each substantive area, without frustrating the student based on 
the difficulty of the item. 

Construct measurement in the FRA is focused on narrow, teachable aspects of the 
intended constructs. For example, vocabulary is thought to be multidimensional 
(receptive/expressive); however measuring the skill more globally or comprehensively 
historically requires establishing a basal and ceiling in both receptive and expressive areas. 
Achieving a reliable and valid score requires many items and takes time away from instruction. 
As such, given the state of research on a subskill like vocabulary, which suggests the correlation 
between receptive and expressive skills is moderate to large, the FRA is focused on measuring 
receptive vocabulary skills in a CAT framework. What may be lost by not measuring expressive 
skills is gained in the efficiency and precision with which we can provide reliable diagnostic 
information on receptive vocabulary. In this way the teacher is able to evaluate vocabulary 
ability as measured by the FRA and determine if further instruction, intervention, or depth in 
diagnostic profiling within a skill is necessary. 

The statistical models used in the FRA are designed to leverage the correlations among 
the constructs as potential sources of information. By using cross-construct information, it is 
possible to obtain information about an examinee’s ability in a particular reading skill by 


measuring a different skill. Under circumstances where such models fit, the FRA leverages the 
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information which, for example, knowledge of letter sounds might contribute to understanding 
student ability in a correlated trait such as phonological awareness. 

In addition to the enhanced precision, reliability, and efficiency of the FRA, scores are 
more readily useable for teachers. The tasks in the FRA were deliberately chosen to answer 
specific questions in modern educational practice and to more intuitively guide appropriate 
instructional decision-making. For example, ability scores were chosen because teachers and 
other educators typically ask if students are progressing in their targeted reading skills. The 
ability score gives a precise and reliable estimate of student’s abilities without the equivalent 
forms problems of more traditional assessment. An important practical utility of the FRA is that 
it gives scores for teachable skills (e.g., Syntactic Knowledge and manipulating word parts in the 
Vocabulary Knowledge task) that are aligned to highly emphasized, standards-based instruction 
(i.e., Common Core State Standards). 

Conclusion 

The goal of this chapter has been to provide a review of assessment design and analytic 
practices, which can be used to contextualize the implications of innovations in reading 
comprehension assessments. We have discussed how assessments reflect a balance of purposes 
and constraints that guide the development of tasks, items, and test forms. More specifically, we 
reviewed how construct definition, standardization, and cost and efficiency help shape and 
constrain practical, reliable, and feasible tests. We also reviewed key issues in measurement and 
psychometrics including classical test theory, unidimensionality, item independence, reliability, 
validity, and item response theory and how they contribute to test construction and the inferences 


that can be made from test scores. 
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Given this foundational review, we also discussed the future of reading assessment by 
drawing on recent innovations in measurement and cognitive theory. We provided examples of 
two complementary assessments that are designed to be used in tandem to provide a broader 
picture of reading achievement. In closing, we note that innovation is relative to the time period 
in which it was conceived. We anticipate future advances in theory and technology will continue 
to transform what was once considered constraints into opportunities for test designers to 


enhance the value and utility of comprehension assessments in applied settings. 
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Table 1 


Purposes or Uses of Comprehension Tests 


Purpose when ‘YP seany Example use cases 2YP re Hes) 8 
administered inference 
Screening Before instructional _—__Identifying individual students at-risk in 
program begins traditional classroom curriculum and Individual 
instruction for potential other services or 
programs. 
Placement Before instructional Place individual students into different — 
program begins levels or groups in a program. Individual 
Diagnostic Before (or as indicated Evaluate specific individual strengths and 
based on other info) | weaknesses that may be relevant to pd 
; : ae : ; Individual 
instructional objectives, intensity, or 
duration. 
Formative During: Daily, as Make day to day instructional-decisions; [pdividual, group, 
Assessment appropriate provide actionable information for instruction. or 
teachers or students. classroom 


Monitoring During: At appropriate Evaluate whether instruction is working — [pdividual, group, 


/Benchmark intervals towards outcome. instruction, or 
classroom 
Outcome After instructional Provide accountability / program Individual, group, 
program delivered improvement information. instructional 
program, 


classroom, school, 
system 
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Figure 1. Graphical representation of (a) unidimensional model, (b) multidimensional correlated factors model, (c) multidimensional 
bi-factor model, and (d) multidimensional second-order factor model. 
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Endnotes 

‘It is worth noting that there are several assumptions made about the errors in classical test 
theory (Kline, 1999). First, it is expected that T and E are uncorrelated, meaning that an 
individuals’ errors, either negative or positive will not maintain a systematic relation with the 
true score. Second, it is expected that an error score on one form of the assessment (e.g., the 
three reading comprehension passages) will be uncorrelated with the error on a parallel form of 
the assessment (e.g., a set of three different reading comprehension passages). Third, it is 
expected that the errors are normally distributed with the average of the random errors around 
the individual’s score to be zero. This means that at times the reading comprehension score may 
be high such as when the student may have particularly high self-efficacy or recalls the 
information well from a prior testing, or low such as when the student skipped breakfast, but 
because the random errors are assumed to be normally distributed, the average across testing 
periods will be zero. 

1 Tt follows then that if tests are strictly parallel, we can replace the covariance of true scores T 
and T' -- COV(T,T’') -- by the variance of true scores V(T), and the CTT assumption of 
uncorrelated errors COV(E,E') = 0 = COV (T,E') gives us what we need. 

it Technically, IRT models do not contain an error variable as a component of the model 
equations. They are based on a probability model for item level variables and assume a latent 
variable. The standard error in IRT models is based on assumptions we make about the model, 


and on what is known as the Fisher information inequality or Cramer Rao lower bound. 


